University of Konstanz

UKN

VAST 2009 Challenge
Challenge 1: -  Badge and Network Traffic

Authors and Affiliations:

Dr. Peter Bak, University of Konstanz, bak@dbvis.inf.uni-konstanz.de [PRIMARY contact]
Svenja Leifert, University of Konstanz, svenja.leifert@uni-konstanz.de [author,analyst]
Christoph Granacher, University of Konstanz, christoph.granacher@uni-konstanz.de [author, analyst]


Tool(s):

 KNIME – Konstanz Information Miner

Developed at: University of Konstanz

by: KNIME CORE TEAM

Version 2.0.3

KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.

www.knime.org



Java

Version 6.0

Java is an object-oriented programming language.

http://java.sun.com/



Microsoft Excel

Version 2007

Microsoft Excel is a spreadsheet-application. Excel is part of Microsoft Office. It features calculation, graphing tools and pivot tables.

http://office.microsoft.com/en-us/excel/FX100487621033.aspx





Video:

Video-UKN-MC1.wmv



ANSWERS:


MC1.1: Identify which computer(s) the employee most likely used to send information to his contact in a tab-delimited table which contains for each computer identified: when the information was sent, how much information was sent and where that information was sent.

Traffic.txt




MC1.2:  Characterize the patterns of behavior of suspicious computer use.

The most characteristic trait of suspicious computer use we found is that the guilty employee uses PCs of workmates that have been absent from their offices.

Our process is a form of the KDD pipeline (see Picture1) with three main iterative phases. The data needed to be prepared (Data Preparation) and analysed with programs or visual analytics tools (Interaction). Then we could draw conclusions and gather new information (Knowledge).



Picture1: Pipeline



In the Data Preparation phase we computed minutes per day and minutes per month. The proxLog and IPLog data tables were put together into an Overview table containing the employee IDs, types and time components, while „types“ is the „Type“ in the proxLog dataset and „Socket“ in the IPLog.

We needed about one hour to prepare the data as we were forced to undertake many small steps like splitting strings, changing data types etc.

In later iterations, this phase only contained different filterings or selections of the data (e.g. looking at IDs seperately).

Everything except the joining of the proxLog and IPLog data tables was done half-automatically: the preprocessing steps needed to be identified manually and were carried out by the system on the data tables.

We could then begin to search for anomalies.

Plotting minutes per month against minutes per day gives a nice overview over each person's data traffic (see Picture2). Colours are mapped to request sizes and red squares symbolise the largest amounts. This gave us a few suspicious IDs but no definite results. However, we found the same IDs in different situations later.

Another approach was plotting the Overview table for each ID with the minutes per month/minutes per day overview and colours mapped to the type (blue=data traffic, green=prox-in-building, red=prox-in-classified, yellow=prox-out-classified; see Picture3). In several cases, a blue square appears between a red and yellow one, which means, the ID's PC had been used while he/she was in the classified area.

To make sure we found all suspicious moments manually, we wrote program 1 to support our findings (see first part, first four rows, Picture4). Two IDs that logged into the classified area without logging out later (ID 38 on the 4th at 13:12, ID 49 on the 8th at 12:56) were discovered. So we wrote program 2 that detected three further exceptions: ID 30 logged out without logging in before (on the 10th at 10:33, 17th at 11:31 and 24th at 9:00).

In this part of the process, everything but the detection of anomalies in the plots was achieved automatically, which took about two and a half hours.




Picture2




Picture3



All suspicious data traffic we found had destination IP 100.59.151.133 so that probably all traffic to it is suspicious (see first three columns, Picture4).

We took a closer look at these occasions.

Looking at the owners of suspicious PCs, their office neighbours and their (probable) behaviour during the time suspicious data traffic occurred (manually, using minutes per month/minutes per day plots), we found out that it was possible to give reasons for the employees' absence in most cases (see Picture4). As the traitor doesn't want to be detected, he/she won't have used an office where anyone was present. However, on two occasions ID 30 is there and active while his neighbour's PC is used.

We know that being in the classified area is quite a good alibi, so we (manually, for all IDs) counted the cases, in which an employee had been in the classified area while suspicious data traffic occurred (see upper half, Picture5). Only ID 27 and 30 never have alibis which makes them highly suspicious. Knowing that it is not impossible to sneak into or out of the classified area, this did not give us definite results.

ID 30's behaviour on the 8th and 15th led us to counting (manually again) in how many cases each employee had been active in a one- and two-minute interval around the suspicious data traffic (see lower half, Picture5). We could see that ID 30 was extremely active in the two-minute intervals (nearly twice as active as any other employee) and concluded that he/she had tried to „fake“ his/her presence in his own office by having data traffic shortly after leaking confidential information. Now, was it possible for ID 30 to know when which office was empty? A look at the office plan reveals that office 15 (ID 30 and 31) offers a good view over most of the affected offices and the corridor to the classified area.

This took us about 2 hours.



Picture4



We can now detect clear patterns in 30's behaviour.

ID 30 has short gaps in his/her own data traffic plot shortly before each time suspicious data traffic occurs, but is often active when he/she is done with his/her malicious conducts. We imagine him/her preparing some kind of data traffic on his own PC before.

Apart from that, he/she began slowly with one sending per day, then two, later three. Three of the first five times were even carried out from his/her own office but he/she became more careful later and used different offices.

If one divides the suspicious data traffic like in Picture4, one group for PCs that have been used while the employees were in the classified area and one for the rest, a clear pattern is visible. While the first group events are spread over the whole day, the others mainly take place in the morning and evening, when many employees are not yet there or have already gone.

Furthermore, all data was sent on Tuesdays and Thursdays.

Here, short looks at different plots (manually, all in all less than half an hour) are enough to detect anomalies, while a deeper investigation of ID 30's data traffic compared to the rest does not give any other results than those already mentioned: sendings to 100.59.151.133.




Picture5